Tiffany Chan
Computer Vision Project Using Convolutional Neural Network Model
Plant Seedlings Classification
1. Import the libraries, load dataset, print shape of data, visualize the images in dataset. (5 Marks).
Import the libraries
#Importing the necessary libraries for this project. There are others that are imported later on, as needed.
import cv2
import math
import numpy as np
import pandas as pd
from glob import glob
from matplotlib import pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D, GlobalMaxPooling2D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.optimizers import RMSprop, Adam
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras import datasets, models, layers, optimizers
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn.metrics import classification_report, confusion_matrix
Load data
Following the code that was provided to attain the plant seedling photos and labels provided by Kaggle:
train_path = "/content/drive/MyDrive/Colab Notebooks/plant_seedlings/train.zip"
!mkdir temp_train
# Extract the files from dataset to temp_train and temp_test folders (Dataset is a zip file.)
from zipfile import ZipFile
with ZipFile(train_path, 'r') as zip:
zip.extractall('./temp_train')
path = "./temp_train/*/*.png" # The path to all images in training set. (* means include all folders and files.)
files = glob(path)
trainImg = [] # Initialize empty list to store the image data as numbers.
trainLabel = [] # Initialize empty list to store the labels of images
j = 1
num = len(files)
# Obtain images and resizing, obtain labels
for img in files:
'''
Append the image data to trainImg list.
Append the labels to trainLabel list.
'''
print(str(j) + "/" + str(num), end="\r")
trainImg.append(cv2.resize(cv2.imread(img), (128, 128))) # Get image (with resizing to 128x128)
trainLabel.append(img.split('/')[-2]) # Get image label (folder name contains the class to which the image belong)
j += 1
trainImg = np.asarray(trainImg) # Train images set
trainLabel = pd.DataFrame(trainLabel) # Train labels set
Print shape of data
#Shape of data
print(trainImg.shape)
print(trainLabel.shape)
We can see from the shape of the data that there are in total 4750 photos of plant seedlings. Each photo is 128x128 pixels in terms of dimensions. 3 means that these photos are in color because there are 3 RBG channels.
#The frequency values of the different plant seedlings.
trainLabel.value_counts()
#The way the trainlabel is distributed.
#Because there are consecutive photos of the same seedling that follow each other, we will need to shuffle them in order to reduce any sort of bias.
#This can be done by default in train-test-split, and can be declared in "shuffle" in the actual running of the model later.
trainLabel.head()
Visualize the images in the dataset
#Let's look at some pictures to make sure they got imported correctly.
i = 0
image = trainImg[i]
label = trainLabel[i][0] # Let us not worry about the labels for now because the labels haven't been transformed into an array yet.
#print(' Label \n Label Id: {} \n Name: {}'.format(label, label_dict[label]))
print(label) # Let us not worry about the labels for now because the labels haven't been transformed into an array yet.
plt.imshow(image);
#Let's look at #122. (This is random to see if we can generate other pictures.)
i = 122
image = trainImg[i]
plt.imshow(image);
These photos are beautiful! However, we will need to ultimately transform these photos into pixel values that range from 0 to 255 for the computer to understand them.
2. Data Pre-processing: (15 Marks): a. Normalization, b. Gaussian Blurring, c. Visualize data after pre-processing.
2A.
#a. Normalization:
# Normalizing the data is important for neural networks. We need to change the pixel values to float values so that we could get decimal values after division.
X = trainImg.astype('float32') / 255.0 # 255 is the maximum value for all colored pixels. 255 equates to a white pixel, while 0 equates to a black pixel. All other colors and variations fall in between.
#Let's check to see if normalization worked.
X
Yes, normalization worked. Values are floats. Values range between 0 and 1.
2B.
# b. Gaussian Blurring
from scipy.ndimage.filters import gaussian_filter
blurred = gaussian_filter(X, sigma=0.8)
Gaussian filter is a way to achieve Gaussian blurring. I liked this function because there is a sigma value that could be calibrated to attain better model performance. I tweaked this value many times and found 0.8 to give the better results.
2C.
#c. Visualize data after pre-processing.
#Let's see how #122 looks compared to the original photo above.
i = 122
image_blur122 = blurred[i]
plt.imshow(image_blur122);
Gaussian blurring was a success!
We could have also converted the images to grayscale, but it wasn't really necessary. It would have perhaps quickened the wait time for the running of the model.
3. Make data compatible: (10 Marks): a. Split the dataset into training,testing, and validation set. (Hint: First split train images and train labels into training and testing set with test_size = 0.3. Then further split test data into test and validation set with test_size = 0.5) b. Reshape data into shapes compatible with Keras models. c. Convert labels from digits to one hot vectors. d. Print the label for y_train[0].
3A.
#Splitting dataset into training, testing and validation sets.
from sklearn.model_selection import train_test_split #Also listed in the first cell, but I re-wrote this for convenience.
#Splitting data into training (70%) and valtest (30%) datasets.
x_train, x_valtest, y_train, y_valtest = train_test_split(blurred,trainLabel , test_size=0.30, random_state=1)
#Making the validation set 50%, test set 50% of the valtest dataframe (25% of the remaining data from the last split)
x_val, x_test, y_val, y_test = train_test_split(x_valtest, y_valtest, test_size=0.5, random_state=1)
#Checking to see if the splitting distributions are correct and we need this information for Keras compatibility reshaping.
print(x_train.shape)
print(x_val.shape)
print(x_test.shape)
3B.
#Reshaping data into shapes compatible with Keras models.
X_train = x_train.reshape(x_train.shape[0], 128, 128, 3)
X_val = x_val.reshape(x_val.shape[0], 128, 128, 3)
X_test = x_test.reshape(x_test.shape[0], 128, 128, 3)
# Convert labels from digits to one hot vectors.
#The target variable (y_train, y_val, and y_train) are still in strings because I haven't done anything to them yet.
#Since the question wants to transform them from digits to one hot vectors, we must transform the strings to categorical numbers first, and then apply one-hot-encoding.
#Get dummies transforms strings to one hot vectors directly.
#One-hot-encoding can only transform numerical values (or digits) into one-hot-vectors
#Let's look at the state of our target variable first before we continue pre-processing
y_train
#Let's rename the target variable as 'seed' instead of '0' to avoid confusion with the digital categorization of each seed.
y_train.columns = ['seed']
print(y_train) #Checking to see that we changed the name of the variable to avoid confusion
#Repeat for y_val and y_test
y_val.columns = ['seed']
y_test.columns = ['seed']
3C.
# c. Convert labels from digits to one hot vectors.
#Encode the string categories into numerical categories because the question asks to convert digits to one hot vectors, not strings to one hot vectors.
#The seedling digits will be assign alphabetically.
#Black-grass will be 0, and sugar beet will be 11.
#Variable is in column form and must be in a horizontal, array form.
y_train = np.ravel(y_train)
#Set label encoder
from sklearn import preprocessing #Re-wrote this for convenience
le = preprocessing.LabelEncoder()
le.fit(y_train)
list(le.classes_) #Looking at the different classes and their assigned order
#Transform y_train into digits. Also check to see it was done properly.
y_train = le.transform(y_train)
y_train
#Repeat for y_val and y_test
#Reuse the same labelencoder to get the same order of the categories
y_val = le.transform(y_val)
y_test = le.transform(y_test)
#Employ labelbinarizer to transform single variable digit categories into features.
from sklearn.preprocessing import LabelBinarizer #Re-wrote this for convenience
enc = LabelBinarizer()
y_train = enc.fit_transform(y_train)
y_val = enc.fit_transform(y_val)
y_test = enc.fit_transform(y_test)
3D.
#d. Print out y_train[0]
y_train[0]
This shows that the first case is the seedling located in the 4th position on this list, which is common chickweed.
#Just checking the dtype of y_train. Although it is categorical in theory, the labels are still interger in nature.
y_train.dtype
4. Building CNN (15 Marks): a. Define layers. b. Set optimizer and loss function. (Use Adam optimizer and categorical crossentropy.)
The biggest challenge was dealing with overfit and underfit models. Here, I discuss what I did to combat these issues.
Tuning the following hyperparameters and associated items:
Limiting the number of convolutional layers: I initially had 4 convolutional layers in my CNN model. Everytime I ran the model with excessive layers, the model would ultimately reach really high training performance but would bomb on the validation data. So, I opted for a less complicated model by shedding down the layers and keeping only 2 convolutional layers with accompanying pooling layers, followed by a simple deep neural network with not too many neurons.
Adding Dropout to each section of the CNN model: Dropout randomly drops out neurons in the deep neural network after the convolutional layers. Dropout can also follow convolutional layers in order to minimize the weights the input is exposed to when running the model. I found this very helpful with layers that had excessive parameters (weights), especially the dense layer in the deep neural network. I also discovered that when you set the dropout to 0.5, it can limit the model's potential and the accuracy for the training data ends up being around 50% during each epoch. Setting too many Dropouts to 0.5 could lead to the model being underfit, and the model doesn't learn enough to make decent predictions. I learned to tackle layers with excessive parameters with higher dropouts will handle overfitting better, and lower dropouts to layers with manageable weights.
Penalizing the parameters (weights) by using a regularizer like L1 in the dense layer: After looking at the model summary, I realized the dense layer in the neural network had excessive parameters. I wanted to find a way in controlling the weights and found out regularizers penalize the weights. Using one regularizer on a layer with excessive weights would help minimize the overfitting issues.
Other: Averagepooling vs. maxpooling: In this scenario, maxpooling had higher accuracy results and lower loss.
Sigma: Changing the sigma value in Gaussian blurring seemed to be a good idea as well. I tried 0.6, 0.8 and 1. For some reason, 0.8 yielded better outcomes.
Increasing the number of epochs and and changing batch size also improved results. 35 epochs with 35 for batch size seemed to be ideal after many attempts.
ReLu vs. sigmoid: ReLu yielded faster results (1 minute vs. 4 minutes). It has a tendency to converge faster than sigmoid. With ReLu, we don't have to worry too much about a vanishing gradient.
Switching the number of filters and filter dimensions was also attempted. 32 filters that were 3x3 for each convolutional layer seemed to yield the better results.
#Building the model.
#This is the best model I came up with.
from tensorflow.keras import datasets, models, layers, optimizers #Re-wrote this for convenience
from tensorflow.keras.layers import Conv2D #Re-wrote this for convenience
from tensorflow.keras import regularizers #Re-wrote this for convenience
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=3, activation="relu", input_shape=(128, 128, 3)))
model.add(layers.BatchNormalization()) #Normalizes the results from each layer before it precedes to the next layer
model.add(layers.MaxPooling2D((2, 2))) #Records the maximum value in the neighborhood after the filter passes the pixels.
model.add(layers.Dropout(0.2)) #Cancels out 20% of the parameters
model.add(Conv2D(filters=32, kernel_size=3, activation="relu"))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.5)) #Cancels out 50% of the parameters
model.add(Flatten())
model.add(Dense(20, activation="relu", kernel_regularizer=regularizers.l1(l=0.01))) #Additional L1 regularizer to penalize the weights in the layer
model.add(Dropout(rate=0.1)) #Cancels out 10% of the parameters
model.add(Dense(12, activation="softmax")) #Use softmax activation function for multiclass classification
model.summary()
#Select the hyperparameters for the optimizer
opt = optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
# Compile the model. Use Adam and categorical crossentropy because it was in the question.
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# We set early stopping so that we don't lose time if the val_loss doesn't improve by 0.001 after 10 epochs.
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping #Re-wrote this for convenience
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=10)
#Adding Model Checkpoint saves the weights whenever val_loss achieves a lower value.
model_checkpoint = ModelCheckpoint('cifar_cnn_checkpoint_{epoch:02d}_loss{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True,
save_weights_only=True,
mode='auto',
period=1)
#Training the model on the train data and validation sets.
history = model.fit(X_train,
y_train,
batch_size=35, #35 for batch size was found to be ideal
epochs=35, #35 epochs was the best, with early stopping, of course.
validation_data=(X_val, y_val),
shuffle=True, #Shuffle, in order to limit the bias.
verbose=1,
callbacks=[early_stopping,model_checkpoint])
# Plot training history so that we can keep track of the minimal changes.
# Plotting train vs. validation loss.
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test') #This is actually on the validation set. Minor mistake on the label on my part.
plt.legend()
plt.show()
Looking at the plotting of the train and validation loss, you can see that the validation loss, although not consistent in each epoch, the line (orange) still follows the overall downward trend of the train line (blue). This is good.
Also, the model stopped running after the 24/35 epoch, because the loss wasn't improving after 10 consecutive evaluations.
#Train and validation accuracy line graph
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.legend()
plt.show()
The same goes for the accuracy line graph. Although the validation line started out very low and slow to learn, the general direction of the validation line (orange) is headed in the same direction as the training line (blue), despite the individual performances at each epoch.
#Training Accuracy
scores = model.evaluate(X_train, y_train, verbose=1)
print('Train loss:', scores[0])
print('Train accuracy:', scores[1])
#Validation Accuracy
scores = model.evaluate(X_val, y_val, verbose=1)
print('Validation loss:', scores[0])
print('Validation accuracy:', scores[1])
#Test Accuracy
scores = model.evaluate(X_test, y_test, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])
Results were good. Train accuracy: 80%, Validation accuracy: 79% and Test accuracy: 80%.
Overall, this CNN model is not overfit, and it was able to deliver a comparable performance on unseen data.
I ran many versions of the CNN model using the train and validate datasets to tune the hyperparameters. The biggest challenge was dealing with overfit and underfit models. I also ran the same model a second time and got slightly better results (Training accuracy: 88%, Validation accuracy: 86%, Test accuracy: 83%). Please see the extra part at the end if interested.
5. Fit and evaluate model and print confusion matrix. (10 Marks)
#Let's get the predictions for all of X_test first.
predictions = model.predict(X_test) #This provides the probabilities of each image for each label.
rounded_predictions = np.argmax(predictions, axis = -1) #This provides the prediction labels from 0-11.
#We need to make sure that the predictions is in the right format. It has to look like the categorical array with 1s and 0s.
rounded_predictions #This is each seedling's prediction in the test_data
#We have to convert the labels from digits to 0s and 1s like we did for y_train, y_val and y_test
#Use labelEncoder like how we transformed the y labels previously.
y_pred = enc.fit_transform(rounded_predictions)
y_pred
# Create a multiclass confusion matrix.
# In order to create the confusion matrix, we need to use argmax because it does not accept on-hot-encoding entries.
from sklearn.metrics import classification_report, confusion_matrix #Re-wrote this for convenience
cm = confusion_matrix(
y_test.argmax(axis=1), y_pred.argmax(axis=1))
cm
Confusion Matrix:
Columns = Prediction
Rows = Actual
As you can tell, the model performed decently on the unseen test data. You coud see high values running in the diagonal (from upper left to lower right), which suggests good predictions.
However, there are some seedlings that are heavily misclassified. 32 cases were misclassified as loose silky-bent when they actually were black-grass. Also, 12 cases of common chickweed were predicted, but they were in fact maize. Thirteen common wheat and 12 fat hen seedlings were misclassified as loose silky-bent.
From these results, the model is faulting on a lot of different species and predicting them to be loose silky-bent.
There are other single-digit residual misclassifications in the multiclass confusion matrix but they are not as significant as the ones just discussed.
#Let's look at the classification report. It will report the precision, recall and F1-score for each label.
print("=== Classification Report ===")
print(classification_report(y_test, y_pred))
Although accuracy is a good metric to determine the strength of the model. Precision and recall are better determinants. All precision values were at least 60%, which is decent. The recall results were worse. 3 plant seedlings (0: black-grass, 4: common wheat and 7: maize) had very low results. Especially with black-grass, the CNN model was only able to predict 5% of all true black-grass seedlings, despite having 67% of its predictions correct for this plant. The CNN model had low recall values for common wheat and maize and showed 39-40% ability to accurately predict them of all the plant images of these 2 species (This was made clear from the confusion matrix, as well).
6.Visualize predictions for x_test[2], x_test[3], x_test[33], x_test[36], x_test[59]. (5 Marks)
#Let's look at the probabilities for X_test[2] as an example of how the model makes its predictions.
#Refer back to previous cell where we generated the predictions for the previous question.
predictions[2]
You could clearly see that the highest probability is 6.94922745e-01, which is in the 8th position (which is #7 in Python because Python starts with 0). So, we would expect 7 to pop up if we were to use the argmax function to find the predictions.
#This is the classification order for all the seedlings used from labelencoder previously.
list(le.classes_)
#For just finding what seedling type each image is predicted to be, use argmax.
#Please use the above order to classify each image.
rounded_predictions = np.argmax(predictions, axis = -1)
print("X_test[2]:")
print(rounded_predictions[2])
print("")
print("X_test[3]")
print(rounded_predictions[3])
print("")
print("X_test[33]")
print(rounded_predictions[33])
print("")
print("X_test[36]")
print(rounded_predictions[36])
print("")
print("X_test[59]")
print(rounded_predictions[59])
X_test[2] is predicted to be: Maize
X_test[3] is predicted to be: Loose Silky-Bent
X_test[33] is predicted to be: Small Flowered-Cranesbill
X_test[36] is predicted to be: Common Chickweed
X_test[59] is predicted to be: Loose Silky-Bent
#EXTRA: Let's also compare these predictions to the actual labels to see how many we got correctly.
print("Actual label for image 2")
print(y_test[2])
print("")
print("Actual label for image 3")
print(y_test[3])
print("")
print("Actual label for image 33")
print(y_test[33])
print("")
print("Actual label for image 36")
print(y_test[36])
print("")
print("Actual label for image 59")
print(y_test[59])
Correct labels: Image 2: Maize Image 3: Loose silky-bent Image 33: Small-flowered Cranesbill Image 36: Common chickweed Image 59: Black-grass
Of all 5 predictions, image 59 was wrongly predicted. So, 1/5 was incorrect, which is expected since the accuracy score of this CNN model is 80% for the test data.
EXTRA: I re-ran the same model and achieved these results. This is to see if the results are generalizable and similar to the ones achieved above.
#Accuracy line graph from a second trial of the model ran above.
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.show()
Clearly, the validation line shows that it follows the general trend of the test line. The only difference is that on this second running of the model, there was still a detectable loss in the latter epochs for it to continue running through the 35th epoch.
Regular metrics for the same model (second time)
scores = model.evaluate(x_test, y_test, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])
scores = model.evaluate(x_train, y_train, verbose=1)
print('Train loss:', scores[0])
print('Train accuracy:', scores[1])
scores = model.evaluate(x_val, y_val, verbose=1)
print('Validation loss:', scores[0])
print('Validation accuracy:', scores[1])
You can see that the metric scores are very similar. to the original model that I worked on. Accuracy on unseen data (test data) for this second trial is 83% compared to the original trial (80%). This goes to show this model is generalizable, and can perhaps enter into production if an estimated 80-83% accuracy on testing is acceptable.
This second trial results have better performance in general. In terms of overfitting, the first trial shows no overall fitting as a whole.